getting-started-header.jpg

🚀 Getting Started¶

This project involves analyzing a healthcare dataset with the aim of predicting the prognosis of various diseases. The dataset includes various features related to patients' health and lifestyle, including age, sex, general health, checkup frequency, exercise habits, smoking history, and the presence of various diseases. Each entry represents a unique patient, and the features capture various factors associated with disease prognosis. 💊💡¶

🔧 Tools and Libraries¶

We will be using Python for this project, along with several libraries for data analysis, machine learning, and data visualization. Here are the main libraries we'll be using:¶

Pandas and Numpy: For data manipulation and analysis.¶

Matplotlib and Seaborn: For data visualization.¶

📚 Dataset¶

The dataset we'll be using includes various features related to patients' health and lifestyle. Each row represents a unique patient and includes attributes such as age, sex, Cholesterol , Glucose Levels, Alcohol intake, and smoking history. 📊💉¶

🎯 Objectives¶

Our main objective is to do Exploratory Data Analysis to understand how the data is spread out and the relationship between the various variables and to present those relationship in visualisation manner for a better understanding. Not only this but we will also find out weather the data has any dupicates or null values and will perform Data Cleaning in such a case.¶

📈 Workflow¶

Here's a brief overview of our workflow for this project:¶

Data Loading and Preprocessing:¶

Load the data and preprocess it for analysis and modeling. This includes handling missing values, converting categorical variables into dummy/indicator variables, and encoding ordinal variables. 📊🔍🧹

Exploratory Data Analysis (EDA):¶

Perform exploratory data analysis to gain insights into the dataset, understand the distributions of features, and explore potential relationships between the features and the disease outcomes. 📊🔬📉

Data Cleansing:¶

Perform data cleansing and transformation to improve the model's performance. This includes imputing missing values and normalizing numeric features. 💡🔬

Domain Knowledge 📚¶

Age:¶

This is the age of the patient. Age is a crucial factor in disease prognosis as the risk of chronic conditions such as heart disease, cancer, diabetes, and arthritis increases with age. This is due to various factors including the cumulative effect of exposure to risk factors, increased wear and tear on the body, and changes in the body's physiological functions. 🌡️👴

Sex:¶

This feature represents the gender of the patient. Gender can influence disease prognosis due to biological differences and gender-specific lifestyle patterns. For instance, heart disease is more common in males, while skin cancer is more common in females. This could be due to factors like longer life expectancy or different exposure to risk factors in each gender. ♀️♂️

General_Health:¶

This is a self-rated health status of the patient. Patients who perceive their health as "Poor" or "Fair" are more likely to have chronic conditions. This could be because the symptoms or management of these conditions impact their perceived health status. 💓

Checkup:¶

This feature represents the frequency of health checkups. Regular health checkups can help in early detection and management of diseases, thereby improving the prognosis. 🏥

Exercise:¶

This feature indicates whether the patient exercises regularly or not. Regular exercise can help control weight, reduce risk of heart diseases, and manage blood sugar and insulin levels, among other benefits. This aligns with the negative correlation observed between exercise and diseases such as heart disease, diabetes, and arthritis. 🏃‍♂️🏋️‍♀️

Smoking_History:¶

This feature indicates whether the patient has a history of smoking. Smoking can increase disease risk as it can damage blood vessels, increase blood pressure, and reduce the amount of oxygen reaching the organs. 🚬🚭

|IMPORTING NECESSARY LIBRARIES|¶

In [27]:
import warnings 
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
In [28]:
# Load the Dataset
cardio_df=pd.read_csv(r'C:\Hero Vired practice\Project\CVD_cleaned.csv',skipinitialspace=True)
In [29]:
cardio_df.head()
Out[29]:
General_Health Checkup Exercise Heart_Disease Skin_Cancer Other_Cancer Depression Diabetes Arthritis Sex Age_Category Height_(cm) Weight_(kg) BMI Smoking_History Alcohol_Consumption Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
0 Poor Within the past 2 years No No No No No No Yes Female 70-74 150.0 32.66 14.54 Yes 0.0 30.0 16.0 12.0
1 Very Good Within the past year No Yes No No No Yes No Female 70-74 165.0 77.11 28.29 No 0.0 30.0 0.0 4.0
2 Very Good Within the past year Yes No No No No Yes No Female 60-64 163.0 88.45 33.47 No 4.0 12.0 3.0 16.0
3 Poor Within the past year Yes Yes No No No Yes No Male 75-79 180.0 93.44 28.73 No 0.0 30.0 30.0 8.0
4 Good Within the past year No No No No No No No Male 80+ 191.0 88.45 24.37 Yes 0.0 8.0 4.0 0.0

Exploratory%20Data%20Analysis.png

Diving into Disease Prognosis¶

This project puts a spotlight on the Healthcare Dataset, aiming to identify key patterns and factors associated with various diseases. With an application of techniques such as univariate, bivariate, and multivariate analysis, we endeavor to unravel complex relationships and significant determinants in the dataset.!¶

For the exploratory data analysis (EDA), we will proceed with the following steps:¶

Univariate Analysis:¶

We'll inspect each variable individually to understand its distribution and potential outliers. This will provide insights into the characteristics of each variable and help identify any extreme values or anomalies.

Bivariate Analysis:¶

We'll explore the relationship between each variable and the target variables (Heart_Disease, Skin_Cancer, Other_Cancer, Diabetes). This analysis will allow us to understand how each variable is associated with the presence or absence of these diseases. We can use techniques like bar charts to visualize the distributions of the target variables based on different categories or levels of other variables.

Multivariate Analysis:¶

We'll study the interactions between different variables and how they collectively relate to the target variables. This analysis will help us uncover complex relationships and patterns that may not be apparent in the univariate or bivariate analyses. Techniques such as scatter plots, correlation matrices, and 3D visualizations can be utilized to gain deeper insights into the data.

We'll start with the target variables, and then move on to the other variables. Since the target variables are binary, we can use bar charts to visualize their distributions. 📊🔍¶

Exploratory%20Data%20Analysis%20%281%29.png

| NUMERICAL VARIABLES|¶

In [30]:
Numerical_Features=['Height_(cm)','Weight_(kg)','BMI','Alcohol_Consumption','Fruit_Consumption','Green_Vegetables_Consumption','FriedPotato_Consumption','Smoking_History']

for i in Numerical_Features:
    plt.figure(figsize=(10,4))
    sns.histplot(data=cardio_df,x=i)
    plt.title('Distribution of '+ i)
    

🔍 Interpretation of Results:¶

📏 Height_(cm):¶

The Height of the patients seems to follow a normal distribution , with the majority of the patients having heights around 160-180(cm).

⚖️ Weight_(kg):¶

The weight of the patients also appears to be normally distributed, with most patients weighing between approximately 60 and 100 kg.

📏⚖️ BMI:¶

The BMI of the patients is slighlty right skewed.A large number of patients have a BMI between 20 and 30, which falls within the normal to overweight range. However, there are also a significant number of patients with a BMI in the obese range (>30).

🍺 Alcohol_Consumption:¶

This features is heavily right-skewed.Most patients have low alcohol consumption , but there are few patients with high consumption.

🍎 Fruit_Consumption:¶

This features is also right-skewed. A lot of patients consume fruits regularly,but a significant number consume them less frequency.

🥦 Green_Vegetables_Consumption:¶

This feature appears to be normally distributed, with most patients consuming green vegetables moderately.

🍟 FriedPotato_Consumption:¶

This feature is right-skewed. Many patients consume fried potatoes less frequently, while a few consume them more often.

|Categorical Values|¶

In [31]:
# Check the distribution of categorical features
categorical_features = ['General_Health', 'Checkup', 'Exercise', 'Heart_Disease', 'Skin_Cancer', 'Other_Cancer', 'Depression', 'Diabetes', 'Arthritis', 'Sex', 'Age_Category', 'Smoking_History']

for feature in categorical_features:
    plt.figure(figsize=(10, 4))
    sns.countplot(data=cardio_df, x=feature)
    plt.title('Count of ' + feature)
    plt.xticks(rotation=90)
    plt.show()

🔍 Interpretation of Results:¶

😊 General_Health:¶

Most patients describe their general health as "Good", with "Very Good" being the second most common response. Fewer patients rate their health as "Fair" or "Poor".

🩺 Checkup:¶

The majority of patients had a checkup within the past year. Fewer patients had their last checkup 2 years ago or more than 5 years ago.

🏋️‍♀️ Exercise:¶

More patients reported that they exercise compared to those who do not.

❤️ Heart_Disease:¶

A significant majority of patients do not have heart disease. Only a small proportion of patients have heart disease.

🌞 Skin_Cancer:¶

The vast majority of patients do not have skin cancer.

🦀 Other_Cancer:¶

Similar to skin cancer, most patients do not have other forms of cancer.

😔 Depression:¶

Most patients do not suffer from depression. However, a non-trivial number of patients do report having depression.

🩸 Diabetes:¶

Similar to the disease-related features above, most patients do not have diabetes. However, a small proportion do have diabetes.

💪 Arthritis:¶

Most patients do not have arthritis, but a significant number do.

♀️♂️ Sex:¶

There are slightly more female patients than male patients in the dataset.

🗓️ Age_Category:¶

The dataset includes patients from a wide range of age categories. The 65-69 age category has the most patients, followed by the 60-64 and 70-74 categories.

Exploratory%20Data%20Analysis%20%282%29.png

In [32]:
selected_variables = ['General_Health', 'Exercise', 'Sex', 'Age_Category', 'Smoking_History']

disease_conditions = ['Heart_Disease', 'Skin_Cancer', 'Other_Cancer', 'Diabetes', 'Arthritis']

for disease in disease_conditions:
    for variable in selected_variables:
        plt.figure(figsize=(10, 4))
        sns.countplot(data=cardio_df, x=variable, hue=disease)
        plt.title('Relationship between ' + variable + ' and ' + disease)
        plt.xticks(rotation=90)
        plt.show()

🔍 Interpretation of Results:¶

❤️ Heart_Disease:¶

❤️Heart disease is more prevalent in patients who rate their general health as "Poor" or "Fair".🩹🔻¶

It is slightly more common in patients who do not exercise. 🏋️‍♂️❌¶

Males are more likely to have heart disease than females. 👨‍⚕️>👩‍⚕️¶

The prevalence of heart disease increases with age, with it being most common in the 80+ age category. 🧓🔝¶

Heart disease is also more common in patients with a history of smoking. 🚬🔺¶

------------------------------------------------------------------------------------------------¶

🌞 Skin_Cancer:¶

Skin cancer is more prevalent in patients who rate their general health as "Good" or "Very Good".👍🔻¶

There is not much difference in prevalence based on exercise habits. 🏃‍♂️⏸️¶

Females are more likely to have skin cancer than males. 👩‍⚕️>👨‍⚕️¶

The prevalence of skin cancer increases with age, with it being most common in the 70-74 age category. 👵🔝¶

There is not much difference in prevalence based on smoking history. 🚬⏸️¶

------------------------------------------------------------------------------------------------¶

🦀 Other_Cancer:¶

Other cancers are more prevalent in patients who rate their general health as "Poor" or "Fair".🩹🔻¶

### They are slightly more common in patients who do not exercise. 🏋️‍♂️❌¶

There is not much difference in prevalence based on sex. 👫⏸️¶

The prevalence of other cancers increases with age, with it being most common in the 75-79 age category. 👵🔝¶

Other cancers are more common in patients with a history of smoking. 🚬🔺¶

------------------------------------------------------------------------------------------------¶

🩸 Diabetes:¶

Diabetes is more prevalent in patients who rate their general health as "Fair" or "Poor".🩹🔻¶

It is more common in patients who do not exercise. 🏋️‍♂️❌¶

There is not much difference in prevalence based on sex. 👫⏸️¶

The prevalence of diabetes increases with age, with it being most common in the 70-74 age category. 👵🔝¶

Diabetes is more common in patients with a history of smoking. 🚬🔺¶

------------------------------------------------------------------------------------------------¶

💪 Arthritis:¶

Arthritis is more prevalent in patients who rate their general health as "Fair" or "Poor".🩹🔻¶

It is slightly more common in patients who do not exercise. 🏋️‍♂️❌¶

Females are more likely to have arthritis than males. 👩‍⚕️>👨‍⚕️¶

The prevalence of arthritis increases with age, with it being most common in the 75-79 age category. 👵🔝¶

Arthritis is slightly more common in patients with a history of smoking. 🚬🔺¶

Correlation%20matrix.png

|DATA PREPROCESSING|¶

In [33]:
cardio_df.head()
Out[33]:
General_Health Checkup Exercise Heart_Disease Skin_Cancer Other_Cancer Depression Diabetes Arthritis Sex Age_Category Height_(cm) Weight_(kg) BMI Smoking_History Alcohol_Consumption Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
0 Poor Within the past 2 years No No No No No No Yes Female 70-74 150.0 32.66 14.54 Yes 0.0 30.0 16.0 12.0
1 Very Good Within the past year No Yes No No No Yes No Female 70-74 165.0 77.11 28.29 No 0.0 30.0 0.0 4.0
2 Very Good Within the past year Yes No No No No Yes No Female 60-64 163.0 88.45 33.47 No 4.0 12.0 3.0 16.0
3 Poor Within the past year Yes Yes No No No Yes No Male 75-79 180.0 93.44 28.73 No 0.0 30.0 30.0 8.0
4 Good Within the past year No No No No No No No Male 80+ 191.0 88.45 24.37 Yes 0.0 8.0 4.0 0.0
In [34]:
cardio_df['General_Health'].unique()
Out[34]:
array(['Poor', 'Very Good', 'Good', 'Fair', 'Excellent'], dtype=object)
In [35]:
General_Health_mapping={'Poor':0,'Good':2,'Very Good':3,'Fair':1,'Excellent':4}
cardio_df['General_Health']=cardio_df['General_Health'].map(General_Health_mapping)


checkup_mapping={'Within the past 2 years': 0, 'Within the past year':1,
       '5 or more years ago': 2, 'Within the past 5 years' : 3 , 'Never': 4}
cardio_df['Checkup']=cardio_df['Checkup'].map(checkup_mapping)


    
Gender_mapping={'Female':0,'Male':1}
cardio_df['Sex']=cardio_df['Sex'].map(Gender_mapping)


Age_category_mapping={'70-74':10, '60-64':8, '75-79':11, '80+':12, '65-69':9, '50-54':6, '45-49':5,
       '18-24':0, '30-34':2, '55-59':7, '35-39':3, '40-44':4, '25-29':1}
cardio_df['Age_Category']=cardio_df['Age_Category'].map(Age_category_mapping)


listt=['Exercise','Heart_Disease','Skin_Cancer','Other_Cancer','Depression','Diabetes','Arthritis','Smoking_History']
for i in listt:
    cardio_df[i]=cardio_df[i].map({'Yes':1 ,'No': 0})
In [36]:
cardio_df.head()
Out[36]:
General_Health Checkup Exercise Heart_Disease Skin_Cancer Other_Cancer Depression Diabetes Arthritis Sex Age_Category Height_(cm) Weight_(kg) BMI Smoking_History Alcohol_Consumption Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
0 0 0 0 0 0 0 0 0.0 1 0 10 150.0 32.66 14.54 1 0.0 30.0 16.0 12.0
1 3 1 0 1 0 0 0 1.0 0 0 10 165.0 77.11 28.29 0 0.0 30.0 0.0 4.0
2 3 1 1 0 0 0 0 1.0 0 0 8 163.0 88.45 33.47 0 4.0 12.0 3.0 16.0
3 0 1 1 1 0 0 0 1.0 0 1 11 180.0 93.44 28.73 0 0.0 30.0 30.0 8.0
4 2 1 0 0 0 0 0 0.0 0 1 12 191.0 88.45 24.37 1 0.0 8.0 4.0 0.0
In [37]:
cardio_df=cardio_df.drop_duplicates()
In [38]:
data=cardio_df.corr()
In [39]:
data
Out[39]:
General_Health Checkup Exercise Heart_Disease Skin_Cancer Other_Cancer Depression Diabetes Arthritis Sex Age_Category Height_(cm) Weight_(kg) BMI Smoking_History Alcohol_Consumption Fruit_Consumption Green_Vegetables_Consumption FriedPotato_Consumption
General_Health 1.000000 0.017505 0.276080 -0.232484 -0.047079 -0.145614 -0.207533 -0.278653 -0.265911 0.018939 -0.167350 0.066930 -0.184197 -0.246444 -0.167538 0.118333 0.102602 0.119738 -0.031816
Checkup 0.017505 1.000000 -0.000230 -0.022603 -0.031536 -0.029821 -0.017365 -0.037276 -0.057399 0.062261 -0.087853 0.049326 0.006746 -0.018371 0.020790 0.014921 -0.028077 -0.024201 0.029969
Exercise 0.276080 -0.000230 1.000000 -0.096321 -0.003963 -0.054363 -0.084673 -0.146645 -0.124785 0.059355 -0.122334 0.091622 -0.090121 -0.155732 -0.093241 0.095028 0.136782 0.124983 -0.036904
Heart_Disease -0.232484 -0.022603 -0.096321 1.000000 0.090835 0.092369 0.032494 0.185341 0.153891 0.072606 0.229027 0.015783 0.045854 0.042642 0.107757 -0.036614 -0.020045 -0.024027 -0.009249
Skin_Cancer -0.047079 -0.031536 -0.003963 0.090835 1.000000 0.150781 -0.013041 0.039286 0.136146 0.009658 0.272075 0.006799 -0.028986 -0.037647 0.032793 0.042734 0.024143 0.012894 -0.038945
Other_Cancer -0.145614 -0.029821 -0.054363 0.092369 0.150781 1.000000 0.015861 0.072281 0.129320 -0.042061 0.234464 -0.043476 -0.021169 0.001015 0.053390 -0.008704 0.007992 -0.003215 -0.033326
Depression -0.207533 -0.017365 -0.084673 0.032494 -0.013041 0.015861 1.000000 0.047341 0.121562 -0.141457 -0.103195 -0.091315 0.047904 0.109557 0.100215 -0.028200 -0.039938 -0.051134 0.018108
Diabetes -0.278653 -0.037276 -0.146645 0.185341 0.039286 0.072281 0.047341 1.000000 0.144823 0.020362 0.216574 -0.017735 0.174925 0.210490 0.059028 -0.116204 -0.022296 -0.032131 -0.003678
Arthritis -0.265911 -0.057399 -0.124785 0.153891 0.136146 0.129320 0.121562 0.144823 1.000000 -0.100047 0.370996 -0.097794 0.074068 0.137924 0.123128 -0.024968 -0.001983 -0.018803 -0.050994
Sex 0.018939 0.062261 0.059355 0.072606 0.009658 -0.042061 -0.141457 0.020362 -0.100047 1.000000 -0.060234 0.698129 0.353989 0.010978 0.073407 0.129311 -0.092486 -0.069169 0.130049
Age_Category -0.167350 -0.087853 -0.122334 0.229027 0.272075 0.234464 -0.103195 0.216574 0.370996 -0.060234 1.000000 -0.120922 -0.062308 -0.007426 0.133155 0.012833 0.043661 0.036030 -0.142761
Height_(cm) 0.066930 0.049326 0.091622 0.015783 0.006799 -0.043476 -0.091315 -0.017735 -0.097794 0.698129 -0.120922 1.000000 0.472175 -0.027413 0.051762 0.128850 -0.045925 -0.030153 0.108790
Weight_(kg) -0.184197 0.006746 -0.090121 0.045854 -0.028986 -0.021169 0.047904 0.174925 0.074068 0.353989 -0.062308 0.472175 1.000000 0.859702 0.047481 -0.032427 -0.090611 -0.075895 0.096327
BMI -0.246444 -0.018371 -0.155732 0.042642 -0.037647 0.001015 0.109557 0.210490 0.137924 0.010978 -0.007426 -0.027413 0.859702 1.000000 0.024794 -0.108750 -0.076603 -0.070629 0.048343
Smoking_History -0.167538 0.020790 -0.093241 0.107757 0.032793 0.053390 0.100215 0.059028 0.123128 0.073407 0.133155 0.051762 0.047481 0.024794 1.000000 0.100553 -0.093626 -0.034371 0.035824
Alcohol_Consumption 0.118333 0.014921 0.095028 -0.036614 0.042734 -0.008704 -0.028200 -0.116204 -0.024968 0.129311 0.012833 0.128850 -0.032427 -0.108750 0.100553 1.000000 -0.012542 0.060088 0.020503
Fruit_Consumption 0.102602 -0.028077 0.136782 -0.020045 0.024143 0.007992 -0.039938 -0.022296 -0.001983 -0.092486 0.043661 -0.045925 -0.090611 -0.076603 -0.093626 -0.012542 1.000000 0.270426 -0.060302
Green_Vegetables_Consumption 0.119738 -0.024201 0.124983 -0.024027 0.012894 -0.003215 -0.051134 -0.032131 -0.018803 -0.069169 0.036030 -0.030153 -0.075895 -0.070629 -0.034371 0.060088 0.270426 1.000000 0.003209
FriedPotato_Consumption -0.031816 0.029969 -0.036904 -0.009249 -0.038945 -0.033326 0.018108 -0.003678 -0.050994 0.130049 -0.142761 0.108790 0.096327 0.048343 0.035824 0.020503 -0.060302 0.003209 1.000000
In [40]:
a=data['General_Health']
b=pd.DataFrame(a)
b
Out[40]:
General_Health
General_Health 1.000000
Checkup 0.017505
Exercise 0.276080
Heart_Disease -0.232484
Skin_Cancer -0.047079
Other_Cancer -0.145614
Depression -0.207533
Diabetes -0.278653
Arthritis -0.265911
Sex 0.018939
Age_Category -0.167350
Height_(cm) 0.066930
Weight_(kg) -0.184197
BMI -0.246444
Smoking_History -0.167538
Alcohol_Consumption 0.118333
Fruit_Consumption 0.102602
Green_Vegetables_Consumption 0.119738
FriedPotato_Consumption -0.031816
In [41]:
sns.heatmap(data,cmap="coolwarm")
Out[41]:
<Axes: >
In [42]:
#### disease_variables = ['Heart_Disease', 'Skin_Cancer', 'Other_Cancer', 'Diabetes']


a=data['Heart_Disease']
b=pd.DataFrame(a)
sns.heatmap(b,annot=True,cmap='coolwarm')
plt.title('Correlation with Heart Disease')
Out[42]:
Text(0.5, 1.0, 'Correlation with Heart Disease')
In [43]:
a=data['Skin_Cancer']
b=pd.DataFrame(a)
sns.heatmap(b,annot=True,cmap='coolwarm')
plt.title('Correlation with Skin Cancer')
Out[43]:
Text(0.5, 1.0, 'Correlation with Skin Cancer')
In [44]:
a=data['Other_Cancer']
b=pd.DataFrame(a)
sns.heatmap(b,annot=True,cmap='coolwarm')
plt.title('Correlation with Other Cancer')
Out[44]:
Text(0.5, 1.0, 'Correlation with Other Cancer')
In [45]:
a=data['Diabetes']
b=pd.DataFrame(a)
sns.heatmap(b,annot=True,cmap='coolwarm')
plt.title('Correlation with Diabetes')
Out[45]:
Text(0.5, 1.0, 'Correlation with Diabetes')

🔍 Interpretation of Results:¶

📊 The correlation heatmaps show the correlation of each feature with the five disease variables: Heart_Disease, Skin_Cancer, Other_Cancer, Diabetes, and Arthritis.¶

🔎 From the heatmaps, we can observe the following:¶

❤️ Heart_Disease:¶

This condition shows a strong positive correlation with Age_Category and General_Health, and a negative correlation with Exercise and Sex_Female.

🌞 Skin_Cancer:¶

This condition is strongly positively correlated with Age_Category and Sex_Male, and negatively correlated with Sex_Female.

🦀 Other_Cancer:¶

This condition shows a strong positive correlation with Age_Category and General_Health, and a negative correlation with Sex_Female.

🩸 Diabetes:¶

This condition shows a strong positive correlation with Age_Category, General_Health, and BMI, and a negative correlation with Exercise.

EDA%20SUmmary.png

|Univariate Analysis: 📊🧪¶

The distributions of the numerical variables, such as Height(cm), Weight(kg), and BMI, were mostly normal, with some features like Alcohol_Consumption, Fruit_Consumption, and FriedPotato_Consumption showing a right-skewed distribution. This suggests that a large proportion of patients have low to moderate consumption levels. 🥦🥔🍷¶

The categorical variables displayed diverse distributions. For instance, most patients rated their general health as "Good" or "Very Good" and had their last checkup within the past year. Moreover, most patients reported exercising regularly, and a majority did not have a history of smoking. 🏥🚭💪¶

|Bivariate Analysis: 📈👥¶

The bivariate analysis revealed relationships between selected features and the disease conditions. Diseases like ❤️ Heart_Disease, 🦀 Other_Cancer, 🩸 Diabetes, and 💪 Arthritis were more prevalent in patients who rated their general health as "Poor" or "Fair", did not exercise, and had a history of smoking. 🌞 Skin_Cancer showed a different pattern, being more prevalent in patients with "Good" or "Very Good" general health and not showing a significant difference based on exercise habits. 🚶‍♀️🚶‍♂️¶

|Multivariate Analysis: 📊🔢¶

The multivariate analysis showed the interplay between multiple variables. For instance, as age increased, the proportion of individuals rating their health as "Good" or "Very Good" decreased, while the proportion rating their health as "Fair" or "Poor" increased. Similarly, individuals who exercised had a higher proportion of "Normal" BMI, while those who did not exercise had a higher proportion of "Overweight" and "Obese" BMI. 👴👵🏋️‍♂️🏋️‍♀️¶

|Correlation Analysis: 🧩🔍¶

Finally, the correlation analysis revealed the strength and direction of the relationships between the features and the disease conditions. Age_Category showed a strong positive correlation with all the diseases, indicating that the risk of these diseases increases with age. Exercise showed a negative correlation, suggesting that regular exercise may help reduce the risk of these diseases. 💪⏳¶

EDA%20SUmmary%20%281%29.png

| REMOVING DUPLICATES¶

In [46]:
cardio_df.isnull().sum()
Out[46]:
General_Health                     0
Checkup                            0
Exercise                           0
Heart_Disease                      0
Skin_Cancer                        0
Other_Cancer                       0
Depression                         0
Diabetes                        9542
Arthritis                          0
Sex                                0
Age_Category                       0
Height_(cm)                        0
Weight_(kg)                        0
BMI                                0
Smoking_History                    0
Alcohol_Consumption                0
Fruit_Consumption                  0
Green_Vegetables_Consumption       0
FriedPotato_Consumption            0
dtype: int64
In [52]:
cardio_df.dropna(inplace=True)
In [53]:
cardio_df.isnull().sum()
Out[53]:
General_Health                  0
Checkup                         0
Exercise                        0
Heart_Disease                   0
Skin_Cancer                     0
Other_Cancer                    0
Depression                      0
Diabetes                        0
Arthritis                       0
Sex                             0
Age_Category                    0
Height_(cm)                     0
Weight_(kg)                     0
BMI                             0
Smoking_History                 0
Alcohol_Consumption             0
Fruit_Consumption               0
Green_Vegetables_Consumption    0
FriedPotato_Consumption         0
dtype: int64

|OUTLIERS IN THE DATASET¶

In [62]:
# Outliers in dataset
Numeric_Values=['Height_(cm)', 'Weight_(kg)', 'BMI', 'Alcohol_Consumption', 
                  'Fruit_Consumption', 'Green_Vegetables_Consumption', 
                  'FriedPotato_Consumption']

for i in Numeric_Values:
    sns.boxplot(cardio_df[i])
    plt.title('Outliers in ' + i)
    plt.show()

🔍 Interpretation of Results:¶

The summary statistics and boxplots indicate that there are some potential outliers in our numerical data. Here are a few observations:¶

Height_(cm):¶

The minimum value is 91 cm, and the maximum is 241 cm. These could be extreme cases, but they're worth investigating further. 📏

Weight_(kg):¶

The maximum weight is 293.02 kg, which seems quite high. This could potentially be an outlier or extreme value. ⚖️

BMI:¶

The maximum BMI is 99.33, which is very high, even for extreme cases of obesity. This might indicate data entry errors. 🍔

Alcohol_Consumption:¶

The maximum value is 30, which seems quite high. We need to understand the measurement units to interpret whether this is an outlier or not. 🍺

Fruit_Consumption, Green_Vegetables_Consumption, FriedPotato_Consumption:¶

The maximum values seem quite high, but it depends on the measurement units (for example, servings per week/month). 🍎🥦🍟

These potential outliers and extreme values should be further investigated to determine their validity and possible impact on the analysis¶

|TREATING THE OUTLIERS¶

In [80]:
def outliers_treatment(column):
    sorted(column)
    q3,q1=np.percentile(column,[25,75])
    IQR=q3-q1
    lower_limit= q1 - (1.5*IQR)
    upper_range = q3 + (1.5*IQR)
    return lower_limit,upper_range

treatment_list=['Height_(cm)', 'Weight_(kg)', 'BMI', 'Alcohol_Consumption', 
                  'Fruit_Consumption', 'Green_Vegetables_Consumption', 
                  'FriedPotato_Consumption']

for i in treatment_list:
    l,u=outliers_treatment(cardio_df[i])
    datta=cardio_df[(cardio_df[i]<u) | (cardio_df[i]>l)]
    dattta=datta.index
    cardio_df.drop(dattta,inplace=True)
In [81]:
Numeric_Values=['Height_(cm)', 'Weight_(kg)', 'BMI', 'Alcohol_Consumption', 
                  'Fruit_Consumption', 'Green_Vegetables_Consumption', 
                  'FriedPotato_Consumption']

for i in Numeric_Values:
    sns.boxplot(cardio_df[i])
    plt.title('Outliers in ' + i)
    plt.show()
    

Untitled%20design%20%284%29.png

In [ ]: